COMP2420/COMP6420 – Introduction to Data Management,
Analysis and Security
Assignment – 2
Maximum Marks
100
Weight
20% of the Total Course Grade
Submission deadline
4:00PM, Friday, May 29th
Submission mode
Electronic, Using GitLab
Penalty
100% after the deadline
Learning Outcomes¶
The following learning outcomes apply to this piece:
• LO1 – Demonstrate a conceptual understanding of database systems and architecture, data models and declarative query languages
• LO2 – Define, query and manipulate a relational database
• LO3 – Demonstrate basic knowledge and understanding of descriptive and predictive data analysis methods, optimization and search, and knowledge representation.
• LO4 – Formulate and extract descriptive and predictive statistics from data
• LO5 – Analyse and interpret results from descriptive and predictive data analysis
• LO6 – Apply their knowledge to a given problem domain and articulate potential data analysis problems
• LO7 – Identify potential pitfalls, and social and ethical implications of data science
• LO8 – Explain key security concepts and the use of cryptographic techniques, digital signatures and PKI in security
Submission¶
You need to submit the following items:
• The notebook Assignment-2-uXXXXXXX.ipynb (where uXXXXXXX is your uid)
• A completed statement-of-originality.md, found in the root of the forked gitlab repo.
Submissions are performed by pushing to your forked GitLab assignment repository. For a refresher on forking and cloning repositories, please refer to Lab 1. Issues with your Git repo (with the exception of a CECS/ANU wide Gitlab failure) will not be considered as grounds for an extension. Any variation of this will result in a zero mark.
Notes:¶
• It is strongly advised to read the whole assignment before attempting it and have at least a cursory glance at the dataset in order to gauge the requirements and understand what you need to do as a bigger picture.
• Backup your assignment to your Gitlab repo often.
• Extra reading and research will be required. Make sure you include all references in your Statement of Originality. If this does not occur, at best marks will be deduced. Otherwise, academic misconduct processes will be followed.
• For answers requiring free form written text, use the designated cells denoted by YOUR WRITTEN ANSWER HERE — double click on the cell to write inside them.
• For all coding questions please write your code after the comment YOUR CODE HERE.
• In the process of testing your code, you can insert more cells or use print statements for debugging, but when submitting your file remember to remove these cells and calls respectively. You are welcome to add additional cells to the final submission, provided they add value to the overall piece.
• Your code answers will be marked on correctness and readability of your code, if your marker can’t understand your code your marks may be deducted.
• Your written answers will be marked on the correctness, depth and clarity of your written answers. If your marker cannot understand your answer, marks may be deducted
• Before submitting, restart the kernel in Jupiter Lab and re-run all cells before submitting your code. This will ensure the namespace has not kept any old variables, as these won’t come across in submission and your code will not run. Without this, you could lose a significant number of marks.
• For written responses, a word limit of 300 words per question applies. There is no minimum word count. After 300 words, the tutors will stop reading your response, marking what has already been said.
Introduction¶
This introduction has been split into three sections, based on the datasets you will be interacting with: CVE revisited, BikeStores, and Rumble.
CVE revisited¶
Sound familiar? We hope so. You are being provided another sample of the NVD CVE dataset from the first assignment, although for a different year. For a refresher, go check the about.md file.
BikeStores¶
To test your SQL muscles, we have provided a database (and the creation scripts) for a sample SQL database called BikeStores. The BikeStores database is modeled off a retail store selling bicycles. We have provided the Database diagram in Q3 for reference.
Rumble¶
One of the goals of this course was to introduce you to a multitude of different scenarios and datasets, and this assignment will be no exception.
The WWE Royal Rumble (Further Reading: Wikipedia or wwe.com) is an annual professional wrestling event put on by the World Wrestling Entertainment (WWE). At the event, the marque match is a 30-man (or now woman!) “battle royale”. (If none of this makes sense, don’t worry. We’re about to explain it.)
What is professional wrestling?¶
Professional Wrestling is a form of entertainment, quite popular in the United States and Japan. “Wrestlers” will “fight” in “matches”, which are aimed to combine theater and athleticism. Matches are predetermined, participants are all willing, and the main aim is to entertain the audience. (Further Reading).
So what is a “Royal Rumble”?¶
The “Royal Rumble” is a type of match that has 3 rules:
• Everyone gets allocated an entrance number (so, you start with 2 and new entries come in every so often).
• Only way to get rid of someone is to throw them out of the ring.
• Last person standing wins.
So, what has this got to do with the assignment?¶
Good question! All of the above is background information so you understand the dataset. In the interests of assessing your data handling, manipulation and SQL skills, you are going to be implementing an SQL database from the csv data files provided.
Data Description¶
You have three (3) datasets to work with in this assignment, broken down as follows:
• Question 1 – CVE
• Question 3 – BikeStores
• Question 4 – Rumble
Once again, the CVE dataset is a sizable dataset (roughly 8000 rows and 24 columns), so it is wise to consider your code in terms of complexity to ensure it doesn’t take 30 minutes to run a single line.
Further reading on the datasets can be found in the following locations:
• CVE about.md
• BikeStores about.md
• Rumble about.md
In [ ]:
# Code Imports
# Every Lab import is here, you may need to uncomment additional items as necessary.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
from sqlite3 import Error
#from scipy import stats
#from sklearn.linear_model import LogisticRegression # Logistic Regression
#from sklearn.neighbors import KNeighborsClassifier # k-Nearest Neighbours
#from sklearn.preprocessing import LabelEncoder # encooding variables
#from sklearn.preprocessing import StandardScaler # encooding variables
#from sklearn.model_selection import train_test_split # testing our models
#from sklearn.preprocessing import OneHotEncoder # nominal variable
#from sklearn.metrics import confusion_matrix # scoring
#from sklearn.tree import DecisionTreeClassifier # decision trees
#from sklearn.tree import DecisionTreeRegressor # decision trees
#from sklearn import tree # decision trees
#from sklearn.decomposition import PCA # PCA
#from sklearn.cluster import KMeans # KMeans Clustering
In [ ]:
# Import additional modules here as required
# It is unlikely that you would need any additional modules, however we had added space here just in case you feel
# extras are required. Note that some justification as to WHY you are using them should be provided.
#
# Note that only modules in the standard Anaconda distribution are allowed. If you need to install it manually, it is not an accepted package.
#
#
Q1: Probing Products (25 marks)¶
Building off the dataset you initially encountered in Assignment 1, the CVE dataset has made a return for the following question. Note that while the dataset holds the same schema, the Base Scores have been included, and the data is from a different year (2018).
The following question is designed to get you to load and process data, and implement a clustering model for the given scenario below. You have been introduced to KMeans clustering in the lectures and labs and this would therefore be the assumed clustering method, although you are welcome to supplement this with other clustering methods from the sklearn package as you desire.
Your scenario is as follows:
Once again, you have assumed the role of a member of a cyber-security team interested in the vulnerability of products. One of your colleagues developed a heuristic that could be used to group CVE entries into relative risk groups that highlight which vulnerabilities need to be patched in products. While this is useful for determining which vulnerabilities should be patched for products that are already within the system, your software procurement team wants to know if there are any products to avoid when buying new products.
Your procurement team has asked you to develop an automated modelling system (ie: clustering model) to identify software that can be grouped based on their risk. You must decide how many groupings, and what risk level (for example, High, Medium or Low) is appropriate for each group. Risk could be based on the number of CVEs and CVSS v2 metrics that can be associated to a product. The procurement team does not wish to use the CVSS Base Score metric as a clustering metric. They are also only interested in products with more than 3 entries, otherwise they predict there will be bias in the results. Finally, they are only interested in using the CVSS v2 system.
Based on this clustering model, the procurement team will make risk-based decisions to determine whether it is safe to introduce a product to the system.
You will first be asked to import and pre-process the data ready to implement a clustering model. Then, you are on your own in the world of clustering. Good luck!
1.1 Preprocessing¶
To start off, bring in the data and get it ready for clustering. Your tasks are:
1. Import the Data. The dataset is available in the location data/cve/nvdcve18.csv [1 mark]
2. Prepare the data for a clustering task. You are welcome to use the data processing code that you wrote for the previous assignment.
Note: While not every part of this section is not directly assessed, you have a number of tasks that will aid in your clustering and data analysis in future questions.
In [ ]:
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
1.2 Building a Clustering Model¶
Based on the scenario above, create a clustering model to group products within the CVE dataset by the risk they may pose. Ensure that:
1. You include a detailed visualisation of the clusters.
2. You choose an appropriate number of clusters.
You should consider whether normalisation and any other data processing steps explored in this course are appropriate for this task.
[10 marks]
In [ ]:
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
1.3 Analysing the Clusters¶
With your clustering model complete, analyse the outputs in preparation for showing the results to the procurement team. Your tasks are as follows:
1.3.1 Comparison of scores¶
1. Compare the mean of each CVSS v2 metric, including the CVSS v2 Base Score, between all clusters in your model.
2. Identify and explain the largest differences in these metrics between the clusters (i.e. what are the defining characteristics of a cluster compared to other cluster(s)?). Ensure that you compare the average CVSS v2 Base Score between each cluster as well.
[5 marks]
In [ ]:
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR RESPONSE HERE
1.3.2 Comparison of scores¶
1. Visualise the distribution of the CVSS2 Score between each cluster.
2. Answer the following: Without the use of a hypothesis test, does the CVSS2 Score appear to differ significantly between these clusters? (You may reference the previous question rather than reproducing answers)
[4 marks]
In [ ]:
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
# YOUR RESPONSE HERE
1.4 Convincing the Procurement Team¶
With your evidence and analysis points well defined, it is now time to present your model and findings to the procurement team.
Explain your model, your reasoning for using the particular CVSS metrics as described in the initial context and which you used in Q1.2, and how this model could be used when they are determining what software should be purchased. Include an example of a product in your explanation.
[5 marks]
# YOUR RESPONSE HERE
In [ ]:
# Code box if needed.
Q2: Eager for Ethics (10 marks)¶
While Data Scientists and Data Engineers spend a lot of timing thinking about how to solve a problem, it is important to think about why we solve a problem and what impacts it could have. For the following scenarios, provide a written response to the questions.
2.1: Scenario One¶
Blume, an organisation at the forefront of technology, has recently hired you as a Data Analyst. In your induction, you discover Blume works with a number of government and private industry organisations in data handling processes, while maintaining a copy of the data. This data is then collected, analysed and indiscriminately sold on to a number of other organisations who have used the data to perform a number of questionable practices. These practices include: adjusting prices of health insurance premiums based on a person’s eating habits, adjusting rent prices based on income, predictive policing, and selective advertising.
What are some of the potential ethical issues that could arise in Blume’s business practices? What harms could arise from the outcomes of Blume’s business dealings? Provide examples in your response to the questions.
[4 marks]
# YOUR RESPONSE HERE
2.2: Scenario One (cont.)¶
As a data analyst for the company, what should you be conscious of when working there and handling the sensitive data that you could be exposed to?
[3 marks]
# YOUR RESPONSE HERE
2.3: Scenario Two¶
You’ve now been working as a Data Analyst at Blume for over a year, and recently Blume has been featured in the news for unethical business practices. A whistle-blower organisation, DedSec, formally approaches you individually, offering to pay you for access to Blume’s data. DedSec claims that “we will be the watch dogs”, reviewing Blume’s practices from inside their systems and reporting any issues they find publicly.
Discuss the ethical issues with letting DedSec have access to Blume’s systems. Provide examples in your response.
[3 marks]
# YOUR RESPONSE HERE
Q3: Serious SQL (20 marks)¶
Consider the following scenario.
You are applying for a job as a database developer for an unnamed wrestling company. Part of the job description includes creating an automation system for running SQL queries. During the hiring process, the interviewers want to ensure you understand the SQL language. They have provided a set of questions to be answered by you, and your responses will later be reviewed by them. They are unwilling to give you access to their real database (which is mysteriously missing), so they have provided an SQLite3 database and asked you to interact with it using Python.
Based on the above scenario, you have been asked to answer a number of questions to test your skills. You will be using the BikeStores database for this question. The database model is as follows:

In the following questions, you will be asked to execute the SQL statement, and explain any reasoning as necessary. Data can be formatted as raw printed output or a Pandas DataFrame. Recall the use of the fetchone and fetchall functions on an sqlite cursor for retriving information.
In [ ]:
# THIS IS YOUR CONNECTION BLOCK, DO NOT MODIFY THIS.
# OTHERWISE, YOU WILL NOT BE ABLE TO READ THE DATABASE
def create_connection(db_file):
“”” Connect to the specified SQLite database, if not exist, create a new one (in memory);
:db_file: location of db to connect to
:return: Connection object or None
“””
conn = None
try:
conn = sqlite3.connect(db_file)
print(“Connection established!”)
except Error as e:
print(“Error Connecting to Database”)
raise(e)
return conn
dbfile_prod = “./data/bikestores/production.db”
dbfile_sales = “./data/bikestores/sales.db”
conn = create_connection(dbfile_prod)
cur = conn.cursor()
cur.execute(‘attach database “‘ + dbfile_prod + ‘” as production;’)
cur.execute(‘attach database “‘ + dbfile_sales + ‘” as sales;’)
# remember to close the connection when everything is done
3.1: Customised Customers¶
List the number of customers in the database.
[2 marks]
In [ ]:
# Your Code Here
3.2: Avoid the Coast¶
List the customers who do not live in the state of California (CA) nor the state of New York (NY).
[2 marks]
In [ ]:
# Your Code Here
3.3: Short on Stock¶
List the amount of stock available for the Heller Shagamaw Frame – 2016 product in each store.
[2 marks]
In [ ]:
# Your Code Here
3.4: Lost Shipments¶
List the earliest and latest order(s) (by order date) for order(s) that have not been shipped (ie: have no shipping date.)
Note: This should be performed in a single SQL query.
[3 marks]
In [ ]:
# Your Code Here
3.5: Started off Simple¶
List the Order IDs and all of the product information of all products that Ollie Zimmerman has ordered.
Note: There may be more than 1 Order ID.
[3 marks]
In [ ]:
# Your Code Here
3.6: Staff Stuff¶
List the Staff ID and _firstname of each staff member, along with the number of orders processed by that staff member and the total sum of the price of orders processed by that staff member. Name the columns: Staff_ID, Staff_Name, Orders and Total_Price respectively.
Note: You may exclude staff that haven’t processed any orders.
[4 marks]
In [ ]:
# Your Code Here
3.7: Haro in 2018¶
List the first name, last name, and email of every customer who, in 2018, ordered any products manufactured by the Haro brand. Sort by _lastname then by first_name, with “A” at the top and “Z” at the bottom. Ensure there are no duplicate entries.
[4 marks]
In [ ]:
# Your Code Here
Q4: Dubious Databases (25 marks)¶
Consider the following scenario:
You’ve been successful in your application as a database developer for an unnamed wrestling company. Your managers wish to understand why it is so difficult to search through the results of past events, and to your surprise, you find all the results as .csv files on your computer. The previous database developer never bothered to make a database, and manually kept track of all the information of the events – making it very difficult for quick searches. You’ve been tasked with building a database for the company to use to fix this. Rather than starting from scratch, you’re hoping you can use the event results to start filling in information. The first folder you find is called rumble.
Your task is as follows:
• Design a database to meet the requirements specified below.
• Draw a simple ER diagram showing the relationship between the tables.
• Implement the database in an SQLite database.
• Import the csv files from the rumble folder and enter the data into the database.
• Complete the final search request as described by your manager below.
The following is the requirements brief that was provided to you by your manager in an email:
Afternoon newbie,
Our company runs a really special match called the Royal Rumble but so far, we haven’t been keeping digital records of it very well. We’ve been running it for a while, so a lot of it is paper-based, but the previous database developer had stored it all in text files which you should have access to.
So basically, the company is built on events, which we run once a year. An event would normally have a title, year, location, and attendance; although we may have to enter some of that information by ourselves later.
Each event has a series of matches. Initially we only ran one match at each event, but lately we’ve been doing one for males and one for females. A match has the following attributes:
• a title,
• list of participants,
• how long each respective participant was in the match,
• how many people they eliminated,
• one winner, and
• how long the match went for. Some other organisations rate their matches, although I don’t like that so don’t worry about that aspect.
Finally, the wrestlers or participants will have the following attributes:
• a stage name,
• real name, and
• gender.
We will have to enter a lot of this information later, but at least we can enter in the information we have. You can just assume the records contain the stage name, not the real name.
Don’t worry about formatting issues; for example, if a wrestler’s name is “Mankind” one year, then “Cactus Jack” another year, and “Dude Love” the year after, you can just pretend they’re different wrestlers. People change their names all the time. Similarly, if the previous database designer entered a wrestler as “D’Lo Brown” in one place and “DLo Brown” in another place, you can also pretend that’s two wrestlers. We’ll fix that up manually later.
After you’ve done all that, can you show me the match title and duration of all the events that Steve Austin won? I believe he goes by “Stone Cold Steve Austin” and “Steve Austin”.
Good luck! Adam Cole
Note: the data is unlikely in an ideal format, so you will need to explore and understand the data yourself before getting started. See the Rumble about.md for additional details (in a real-world situation, even this information would not be provided). We will only be able to clarify any errors or unreasonably ambiguous details. Remember to state all assumptions beforehand.
Note 2:_ Diagrams can be drawn by hand and scanned/photographed, or you can use a tool such as draw.io. You should include the diagram in the img folder, and import it into a markdown box in your notebook. You can use the command ![er.png](./img/er.png) where er.png is the name of your ER diagram file.
[25 marks]
# YOUR RESPONSE HERE
In [ ]:
# YOUR CODE HERE
# (ANY ADDITIONAL CELLS AS REQUIRED)
Q5: Shocking Security (20 marks)¶
To finish, please answer the following questions in the raw text boxes provided.
5.1: Secure Signing¶
Consider the following scenario:
Gizmo Technologies has outsourced some work to Tom and Alice, and needs to get them to jointly sign a contract for the work. Since the parties involved are located in different parts of the world, and this is a frequently occurring scenario, Gizmo Technologies decides to come up with a method for doing this electronically. The contract has to be signed by both Tom and Alice. We assume that the contract is transmitted electronically over public channels, so integrity and confidentiality have to be assured. Both Tom and Alice need to be assured that they are both signing the same contract and need to have a copy of the signed contract. The contract needs to be non-repudiable and the process has to be efficient.
Describe a method that Gizmo Technologies can use for this purpose that uses cryptographic techniques and meets the above requirements.
[16 marks]
# YOUR RESPONSE HERE
5.2: Resilient ito the future¶
As technology evolves, so does the need for secure cryptographic algorithms. With the introduction of quantum computing, the organisations around the world are preparing to migrate towards quantum resistant algorithms.
Briefly explain why there is a need for quantum-resistant cryptographic algorithms. Provide an example of a quantum-resistant cryptographic algorithm in your explanation.
[4 marks]
# YOUR RESPONSE HERE